NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

LiveDataLab: A Cloud-based Open Lab for Integrating Big Data Research, Education, and Applications

https://doi.org/10.1109/BigData62323.2024.10825402

Zhai, ChengXiang (December 2024, IEEE)

We present the vision of LiveDataLab and discuss the new research directions and application opportunities it opens up. LiveDataLab is envisioned to be a cloud-based open lab infrastructure where research, education, and application development in big data can be integrated in one unified platform, thus accelerating research, technology transfer, and workforce development in big data.
more » « less
Full Text Available
Large Language Models and Future of Information Retrieval: Opportunities and Challenges

https://doi.org/10.1145/3626772.3657848

Zhai, ChengXiang (July 2024, ACM)

Recent years have seen great success of large language models (LLMs) in performing many natural language processing tasks with impressive performance, including tasks that directly serve users such as question answering and text summarization. They open up unprecedented opportunities for transforming information retrieval (IR) research and applications. However, concerns such as halluciation undermine their trustworthiness, limiting their actual utility when deployed in real-world applications, especially high-stake applications where trust is vital. How can we both exploit the strengths of LLMs and mitigate any risk caused by their weaknesses when applying LLMs to IR? What are the best opportunities for us to apply LLMs to IR? What are the major challenges that we will need to address in the future to fully exploit such opportunities? Given the anticipated growth of LLMs, what will future information retrieval systems look like? Will LLMs eventually replace an IR system? In this perspective paper, we examine these questions and provide provisional answers to them. We argue that LLMs will not be able to replace search engines, and future LLMs would need to learn how to use a search engine so that they can interact with a search engine on behalf of users. We conclude with a set of promising future research directions in applying LLMs to IR.
more » « less
Full Text Available
The Law of Knowledge Overshadowing: Towards Understanding, Predicting and Preventing LLM Hallucination

https://doi.org/10.18653/v1/2025.findings-acl.1199

Zhang, Yuji; Li, Sha; Qian, Cheng; Liu, Jiateng; Yu, Pengfei; Han, Chi; Fung, Yi R; McKeown, Kathleen; Zhai, ChengXiang; Li, Manling; et al (January 2025, Association for Computational Linguistics)

Full Text Available
Large language models for whole-learner support: opportunities and challenges

https://doi.org/10.3389/frai.2024.1460364

Mannekote, Amogh; Davies, Adam; Pinto, Juan D; Zhang, Shan; Olds, Daniel; Schroeder, Noah L; Lehman, Blair; Zapata-Rivera, Diego; Zhai, ChengXiang (October 2024, Frontiers in Artificial Intelligence)

In recent years, large language models (LLMs) have seen rapid advancement and adoption, and are increasingly being used in educational contexts. In this perspective article, we explore the open challenge of leveraging LLMs to create personalized learning environments that support the “whole learner” by modeling and adapting to both cognitive and non-cognitive characteristics. We identify three key challenges toward this vision: (1) improving the interpretability of LLMs' representations of whole learners, (2) implementing adaptive technologies that can leverage such representations to provide tailored pedagogical support, and (3) authoring and evaluating LLM-based educational agents. For interpretability, we discuss approaches for explaining LLM behaviors in terms of their internal representations of learners; for adaptation, we examine how LLMs can be used to provide context-aware feedback and scaffold non-cognitive skills through natural language interactions; and for authoring, we highlight the opportunities and challenges involved in using natural language instructions to specify behaviors of educational agents. Addressing these challenges will enable personalized AI tutors that can enhance learning by accounting for each student's unique background, abilities, motivations, and socioemotional needs.
more » « less
Full Text Available
AnaDE1.0: A Novel Data Set for Benchmarking Analogy Detection and Extraction

Bhavya, Bhavya; Sehgal, Shradha; Xiong, Jinjun; Zhai, ChengXiang (March 2024, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics)

Textual analogies that make comparisons between two concepts are often used for explaining complex ideas, creative writing, and scientific discovery. In this paper, we propose and study a new task, called Analogy Detection and Extraction (AnaDE), which includes three synergistic sub-tasks: 1) detecting documents containing analogies, 2) extracting text segments that make up the analogy, and 3) identifying the (source and target) concepts being compared. To facilitate the study of this new task, we create a benchmark dataset by scraping Metamia.com and investigate the performances of state-of-the-art models on all sub-tasks to establish the first-generation benchmark results for this new task. We find that the Longformer model achieves the best performance on all the three sub-tasks demonstrating its effectiveness for handling long texts. Moreover, smaller models fine-tuned on our dataset perform better than non-finetuned ChatGPT, suggesting high task difficulty. Overall, the models achieve a high performance on documents detection suggesting that it could be used to develop applications like analogy search engines. Further, there is a large room for improvement on the segment and concept extraction tasks.
more » « less
Full Text Available
AnaDE1.0: A Novel Data Set for Benchmarking Analogy Detection and Extraction

Bhavya, Bhavya; Sehgal, Shradha; Xiong, Jinjun; Zhai, ChengXiang (March 2024, Association for Computational Linguistics)

Full Text Available
The CDL: An Online Platform for Creating Community-based Digital Libraries

https://doi.org/10.1145/3584931.3607495

Ros, Kevin; Zhai, ChengXiang (October 2023, ACM)

Full Text Available
Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains

https://doi.org/10.1609/AAAI.V38I17.29933

Zhang, Yu; Zhang, Yunyi; Shen, Yanzhen; Deng, Yu; Popa, Lucian; Shwartz, Larisa; Zhai, ChengXiang; Han, Jiawei (March 2024, Proceedings of the AAAI Conference on Artificial Intelligence)
Wooldridge, Michael J; Dy, Jennifer G; Natarajan, Sriraam (Ed.)
Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines. Code and data are available at: https://github.com/yuzhimanhua/SEType.
more » « less
Full Text Available
Retrieving Webpages Using Online Discussions

https://doi.org/10.1145/3578337.3605139

Ros, Kevin; Jin, Matthew; Levine, Jacob; Zhai, ChengXiang (August 2023, ACM)

Full Text Available
KEBLM: Knowledge-Enhanced Biomedical Language Models

https://doi.org/10.1016/j.jbi.2023.104392

Lai, Tuan Manh; Zhai, ChengXiang; Ji, Heng (July 2023, Journal of Biomedical Informatics)

Full Text Available

« Prev Next »

Search for: All records